1 Packages

2 Overview

This is a fast-paced course that covers a lot of material. There will be a large amount of references. You may need to do your own research to fill in the gaps in between lectures and homework/projects. It is impossible to learn data science without getting your hands dirty. Please budget your time evenly. Last-minute work ethic will not work for this course.

Homework in this course is different from your usual homework assignment as a typical student. Most of the time, they are built over real case studies. While you will be applying methods covered in lectures, you will also find that extra teaching materials appear here. The focus will be always on the goals of the study, the usefulness of the data gathered, and the limitations in any conclusions you may draw. Always try to challenge your data analysis in a critical way. Frequently, there are no unique solutions.

Case studies in each homework can be listed as your data science projects (e.g. on your CV) where you see fit.

2.1 Objectives

  • Get familiar with R-studio and RMarkdown
  • Hands-on R
  • Learn data science essentials
    • gather data
    • clean data
    • summarize data
    • display data
    • conclusion
  • Packages
    • dplyr
    • ggplot

2.2 Instructions

  • Homework assignments can be done in a group consisting of up to three members. Please find your group members as soon as possible and register your group on our Canvas site.

  • All work submitted should be completed in the R Markdown format. You can find a cheat sheet for R Markdown here For those who have never used it before, we urge you to start this homework as soon as possible.

  • Submit the following files, one submission for each group: (1) Rmd file, (2) a compiled HTML or pdf version, and (3) all necessary data files if different from our source data. You may directly edit this .rmd file to add your answers. If you intend to work on the problems separately within your group, compile your answers into one Rmd file before submitting. We encourage that you at least attempt each problem by yourself before working with your teammates. Additionally, ensure that you can ‘knit’ or compile your Rmd file. It is also likely that you need to configure Rstudio to properly convert files to PDF. These instructions might be helpful.

  • In general, be as concise as possible while giving a fully complete answer to each question. All necessary datasets are available in this homework folder on Canvas. Make sure to document your code with comments (written on separate lines in a code chunk using a hashtag # before the comment) so the teaching fellows can follow along. R Markdown is particularly useful because it follows a ‘stream of consciousness’ approach: as you write code in a code chunk, make sure to explain what you are doing outside of the chunk.

  • A few good or solicited submissions will be used as sample solutions. When those are released, make sure to compare your answers and understand the solutions.

2.3 Review materials

  • Study Basic R Tutorial
  • Study Advanced R Tutorial (to include dplyr and ggplot)
  • Study lecture 1: Data Acquisition and EDA

3 Case study 1: Audience Size

How successful is the Wharton Talk Show Business Radio Powered by the Wharton School

Background: Have you ever listened to SiriusXM? Do you know there is a Talk Show run by Wharton professors in Sirius Radio? Wharton launched a talk show called Business Radio Powered by the Wharton School through the Sirius Radio station in January of 2014. Within a short period of time the general reaction seemed to be overwhelmingly positive. To find out the audience size for the show, we designed a survey and collected a data set via MTURK in May of 2014. Our goal was to estimate the audience size. There were 51.6 million Sirius Radio listeners then. One approach is to estimate the proportion of the Wharton listeners to that of the Sirius listeners, \(p\), so that we will come up with an audience size estimate of approximately 51.6 million times \(p\).

To do so, we launched a survey via Amazon Mechanical Turk (MTurk) on May 24, 2014 at an offered price of $0.10 for each answered survey. We set it to be run for 6 days with a target maximum sample size of 2000 as our goal. Most of the observations came in within the first two days. The main questions of interest are “Have you ever listened to Sirius Radio” and “Have you ever listened to Sirius Business Radio by Wharton?”. A few demographic features used as control variables were also collected; these include Gender, Age and Household Income.

We requested that only people in United States answer the questions. Each person can only fill in the questionnaire once to avoid duplicates. Aside from these restrictions, we opened the survey to everyone in MTurk with a hope that the sample would be more randomly chosen.

The raw data is stored as Survey_results_final.csv on Canvas.

3.1 Data preparation

3.1.1 We need to clean and select only the variables of interest.

  • Select only the variables Age, Gender, Education Level, Household Income in 2013, Sirius Listener?, Wharton Listener? and Time used to finish the survey.

  • Change the variable names to be “age”, “gender”, “education”, “income”, “sirius”, “wharton”, “worktime”.

3.1.2 Handle missing/wrongly filled values of the selected variables

As in real world data with user input, the data is incomplete, with missing values, and has incorrect responses. There is no general rule for dealing with these problems beyond “use common sense.” In whatever case, explain what the problems were and how you addressed them. Be sure to explain your rationale for your chosen methods of handling issues with the data. Do not use Excel for this, however tempting it might be.

Tip: Reflect on the reasons for which data could be wrong or missing. How would you address each case? For this homework, if you are trying to predict missing values with regression, you are definitely overthinking. Keep it simple.

3.1.2.1 Summary of missing values

Before this point, we thought there were just blanks and NAs in the data, but discovered there were unchanged entire fields in the form os “select one” for at least the education column and incorrect values in the age column. Here we tabulate these missing entries.

Breakdown of Missing and Incorrect Values
age gender education income sirius wharton worktime
Missing or NA Values 1 6 0 6 5 4 0
Incorrect Values 5 0 19 0 0 0 0
  • Within the age variable, 1 person did not respond, 2 people selected ages that did not make sense in the context of the survey (4 and 223), and 2 people wrote their age as a character “eighteen (18)”, “27`”, and female. We opted to remove the missing responses, and the incorrect ages; however, for the values “eighteen (18)” and “27’”, we changed these to their numeric values.

  • Within the education variable, 19 selected ‘select one’ and we opted to remove these

  • Within the gender, income, wharton, and sirius variables, we removed all missing values.

  • We did not remove any values from the worktime variable since it did not contain any missing or incorrect values.

Although there we remove 44 total (22 missing and 22 incorrect values) only 37 surveys are removed because some people have multiple variables missing.

3.2 Brief summary

Write a brief report to summarize all the variables collected. Include both summary statistics (including sample size) and graphical displays such as histograms or bar charts where appropriate. Comment on what you have found from this sample. (For example - it’s very interesting to think about why would one work for a job that pays only 10cents/each survey? Who are those survey workers? The answer may be interesting even if it may not directly relate to our goal.)

This study received 1,764 surveys. We removed 37 of those after conductiong quality control, resulting in a total of 1,727 surveys. In general, the participants are mostly young, ranging from early to late twenties. Most are Male. The majority have at least attended some college and make mkae below 50,000 dollars. We describe the population of participants in greater detail and by several demographic categories below.

3.2.1 Age

  • age ranges from 18-76

  • It is right skewed, with most participants being younger in their early to late twenties

We can see the distribution of ages with a histogram and examine how this distribution is distributed by other variables

3.2.2 Gender

  • There are 729 (42%) Females and 998 (58%) Males in the survey pool.

3.2.3 education

Most participants have completed some college education.

Survey Demographics by education
Counts Percentage
Other 2 0.116
Less than 12 years; no high school diploma 10 0.579
High school graduate (or equivalent) 190 11.002
Some college, no diploma; or Associate’s degree 737 42.675
Bachelor’s degree or other 4-year degree 611 35.379
Graduate or professional degree 177 10.249

3.2.4 Income

Most participants make below $50,000

Survey Demographics by Income
Counts Percentage
Less than $15,000 206 11.93
$15,000 - $30,000 360 20.84
$30,000 - $50,000 420 24.32
$50,000 - $75,000 371 21.48
$75,000 - $150,000 326 18.88
Above $150,000 44 2.55

3.2.5 Sirius Listeners

  • 77% of participants listen to Sirius Radio

### Business Radio Powered by the Wharton School Listeners

  • 4% of participants listen to Business Radio Powered by the Wharton School

3.2.6 Worktime

  • average worktime 22.5 seconds
  • ranged from 8-108 seconds

3.3 Sample properties

The population from which the sample is drawn determines where the results of our analysis can be applied or generalized. We include some basic demographic information for the purpose of identifying sample bias, if any exists. Combine our data and the general population distribution in age, gender and income to try to characterize our sample on hand.

  1. Does this sample appear to be a random sample from the general population of the USA?

  2. Does this sample appear to be a random sample from the MTURK population?

Note: You can not provide evidence by simply looking at our data here. For example, you need to find distribution of education in our age group in US to see if the two groups match in distribution. You may need to gather some background information about the MTURK population to have a slight sense if this particular sample seem to a random sample from there… Please do not spend too much time gathering evidence.

We use several datasets from the from the US Census Bureau

  • nc-est2019-sr11h: Annual Estimates of the Resident Population by Sex, Race, and Hispanic Origin for the United States: April 1, 2010 toJuly 1, 2019

  • nc-est2019-agesex: Annual Estimates of the Resident Population for Selected Age Groups by Sex for the United States: April 1, 2010 to July 1, 2019

  • data/table-1-01.xlsx: Table 1. Educational Attainment of the Population 18 Years and Over, by Age, Sex, Race, and Hispanic Origin: 2014

  • data/hinc01R_1.xls: HINC-01. Selected Characteristics of Households, by Total Money Income in 2013. (income data comes from 2013)

3.3.1 Gender

Is the distribution of gender from the survey participants a random sample from the general population of the USA?

Additionally we use data from this paper that sample from a total of 2,026 U.S. adults in mid-December 2019 in the MTURK database to estimate the demographic information of participants in MTURK

3.3.2 Age

Is the distribution of ages from the survey participants a random sample from the general population of the USA?

### Education Is the distribution of education from the survey participants a random sample from the general population of the USA?

3.3.3 Income

Is the distribution of income from the survey participants a random sample from the general population of the USA?

  1. Does this sample appear to be a random sample from the general population of the USA?
  • This sample slightly differs from the population of the USA in some demographic information, but appears to be a random sample. We would have to do more analysis outside of looking at percentages to quantify the magnitude of the difference
  • The participants completely exclude anyone younger than 18, and the prticipants included are younger than the US population
  • There is a higher percentage of males in this sample (58%) while the US has closer to 50%
  • The participants are more educated, with a larger percentage of participants with some college experience, and a larger percentage with at least a bachelor’s degree
  • The income distribution was within a few percentage points of the U.S. population for each income bracket except people making more than $150,000 per year.
  1. Does this sample appear to be a random sample from the MTURK population?
  • This sample appears to be a random sample of the population from the MTURK population.
  • In both populations, participants included are younger than the US population, varying by a few percentage points
  • There is a higher percentage of males in this sample (58%) while the MTURK has closer to 43%
  • In both populations, the participants are more educated, with a larg percentage of participants with some college experience, and a larger percentage with at least a bachelor’s degree
  • The income distribution was within a few percentage points of the MTURK population for each income bracket except people making more than $150,000 per year.

3.4 Final estimate

Give a final estimate of the Wharton audience size in January 2014. Assume that the sample is a random sample of the MTURK population, and that the proportion of Wharton listeners vs. Sirius listeners in the general population is the same as that in the MTURK population. Write a brief executive summary to summarize your findings and how you came to that conclusion.

To be specific, you should include:

  1. Goal of the study
  2. Method used: data gathering, estimation methods
  3. Findings
  4. Limitations of the study.

Wharton launched a talk show called Business Radio Powered by the Wharton School through the Sirius Radio station in January of 2014. To find out the audience size for the show, a survey was designed and collected a data set via MTURK in May of 2014. The goal was to estimate the audience size.

To do so, launched a survey via Amazon Mechanical Turk (MTurk) was launched on May 24, 2014 at an offered price of $0.05 for each answered survey. WIt was set to be run for 6 days with a target maximum sample size of 2000 as the goal. Most of the observations came in within the first two days. The main questions of interest are “Have you ever listened to Sirius Radio” and “Have you ever listened to Sirius Business Radio by Wharton?”. A few demographic features used as control variables were also collected; these include Gender, Age and Household Income.

It was requested that only people in United States answer the questions. Each person can only fill in the questionnaire once to avoid duplicates. Aside from these restrictions, the survey was open to everyone in MTurk with a hope that the sample would be more randomly chosen.

We Assume that the sample is a random sample of the MTURK population, and that the proportion of Wharton listeners vs. Sirius listeners in the general population is the same as that in the MTURK population.

There were 51.6 million Sirius Radio listeners then. One approach is to estimate the proportion of the Wharton listeners to that of the Sirius listeners, \(p\), so that we will come up with an audience size estimate of approximately 51.6 million times \(p\). Using this method, we estimated that the Wharton audience size in January 2014 is between [1982330, 3189248], specifically 2,585,789.

A major limitation of this study is that the population sampled from MTURK is not a random sample of the US population and differs quite significantly.

3.5 New task

Now suppose you are asked to design a study to estimate the audience size of Wharton Business Radio Show as of today: You are given a budget of $1000. You need to present your findings in two months.

3.5.1 Write a proposal for this study which includes:

  1. Method proposed to estimate the audience size.
  • Since we know that most participants have some sort of college experience, we could poll audiences within and adjacent to those speces with a qualtrics survey sent across the country. The incentive to participate would be cash prizes raffled off to the top N participants and/or one year of subscription service to Sirus XM radio.
  1. What data should be collected and where it should be sourced from.
  • The Data collected should be Age, Education, Gender, Location, Experience with Wharton Radio, How they found out abotu Wharton Radio.

Please fill in the google form to list your platform where surveys will be launched and collected HERE

A good proposal will give an accurate estimation with the least amount of money used.

4 Case study 2: Women in Science

Are women underrepresented in science in general? How does gender relate to the type of educational degree pursued? Does the number of higher degrees increase over the years? In an attempt to answer these questions, we assembled a data set (WomenData_06_16.xlsx) from NSF about various degrees granted in the U.S. from 2006 to 2016. It contains the following variables: Field (Non-science-engineering (Non-S&E) and sciences (Computer sciences, Mathematics and statistics, etc.)), Degree (BS, MS, PhD), Sex (M, F), Number of degrees granted, and Year.

Our goal is to answer the above questions only through EDA (Exploratory Data Analyses) without formal testing. We have provided sample R-codes in the appendix to help you if needed.

4.1 Data preparation

4.1.1 Understand and clean the data

Notice the data came in as an Excel file. We need to use the package readxl and the function read_excel() to read the data WomenData_06_16.xlsx into R.

  • Read the data into R.

  • Clean the names of each variables. (Change variable names to Field,Degree, Sex, Year and Number )

  • Set the variable natures properly.

  • Any missing values?

We can count the number of NA values, if there are any. In this dataset, there are no missing values.

4.2 Write a summary describing the data set provided here.

  • How many fields are there in this data?

We can count the amount of unique entries in the Field variable

There are 10 Fields possible: Agricultural sciences; Biological sciences; Computer sciences; Earth, atmospheric, and ocean sciences; Engineering; Mathematics and statistics; Non-S&E; Physical sciences; Psychology; and Social sciences

  • What are the degree types?

There are 3 degree types: BS, MS, and PhD

  • How many year’s statistics are being reported here?

There are statistics for 11 years from 2006 to 2016

  • Other Variables

Other variables included in this study are Sex, Male or Female, and Number, the number of degrees awarded.

4.3 BS degrees in 2015

Is there evidence that more males are in science-related fields vs Non-S&E? Provide summary statistics and a plot which shows the number of people by gender and by field. Write a brief summary to describe your findings.

Is there evidence that more males are in science-related fields vs Non-S&E? –> There are more people overall in Non-S&E fields, in which there are more women. However, there are more men in the S&E fields from these plots.

4.4 EDA bringing type of degree, field and gender in 2015

Describe the number of people by type of degree, field, and gender. Do you see any evidence of gender effects over different types of degrees? Again, provide graphs to summarize your findings.

The proportion of non s&e fields and the degree types is relatively the same, but more variability in the s&e degrees. males and females obtain about the same science BS degrees (females slightly more), but males have more science MS and PhDs.

4.5 EDA bring all variables

In this last portion of the EDA, we ask you to provide evidence numerically and graphically: Do the number of degrees change by gender, field, and time?

Do the number of degrees change by gender, field, and time? –> Females appear to have more degrees overall at the BS and MS level, but Males have more PhDs in STEM fields, compared to women. Males also have more STEM MS degrees than Females.

4.6 Women in Data Science

Finally, is there evidence showing that women are underrepresented in data science? Data science is an interdisciplinary field of computer science, math, and statistics. You may include year and/or degree.

Overall, we believe there is enough graphical evidence that women are underrepresented in data science related fields. In computer science, that disparity is ony getting worse with time. Overall, there are less people in math and statistics at the BS and MS level; however, with more oerall degress in math a the PhD level, women still get about half the amount of math PhDs as men.

4.7 Final brief report

Summarize your findings focusing on answering the questions regarding if we see consistent patterns that more males pursue science-related fields. Any concerns with the data set? How could we improve on the study?

In general, there are more people overall in Non-S&E fields, in which there are more women. However, we consistently see that there are more men in the S&E fields. The proportion of non s&e fields and the degree types is relatively the same, but more variability in the s&e degrees. males and females obtain about the same science BS degrees (females slightly more), but males have more science MS and PhDs.Females appear to have more degrees overall at the BS and MS level, but Males have more PhDs in STEM fields, compared to women. Males also have more STEM MS degrees than Females.

We believe there is enough graphical evidence that women are underrepresented in data science related fields. In computer science, that disparity is ony getting worse with time. Overall, there are less people in math and statistics at the BS and MS level; however, with more oerall degress in math a the PhD level, women still get about half the amount of math PhDs as men.

We can improve on this study by also including people who pursued stem degrees but did not follow through, o2 switched to another field.

5 Case study 3: Major League Baseball

We would like to explore how payroll affects performance among Major League Baseball teams. The data is prepared in two formats record payroll, winning numbers/percentage by team from 1998 to 2014.

Here are the datasets:

-MLPayData_Total.csv: wide format -baseball.csv: long format

Feel free to use either dataset to address the problems.

5.1 EDA: Relationship between payroll changes and performance

Payroll may relate to performance among ML Baseball teams. One possible argument is that what affects this year’s performance is not this year’s payroll, but the amount that payroll increased from last year. Let us look into this through EDA.

Create increment in payroll

a). To describe the increment of payroll in each year there are several possible approaches. Take 2013 as an example:

- option 1: diff: payroll_2013 - payroll_2012
- option 2: log diff: log(payroll_2013) - log(payroll_2012)

Explain why the log difference is more appropriate in this setup.

6 needs answer

b). Create a new variable diff_log=log(payroll_2013) - log(payroll_2012). Hint: use dplyr::lag() function.

c). Create a long data table including: team, year, diff_log, win_pct

6.1 Exploratory questions

a). Which five teams had highest increase in their payroll between years 2010 and 2014, inclusive?

The Los Angeles Dodgers, Miami Marlins, Houston Astros, Kansas City Royals, and the Texas Rangers

Top Five teams with the highest increase in their payroll between years 2010 and 2014, inclusive
team year diff_log win_pct
Los Angeles Dodgers 2013 0.823 0.568
Miami Marlins 2012 0.729 0.426
Houston Astros 2014 0.703 0.432
Kansas City Royals 2012 0.522 0.444
Texas Rangers 2011 0.513 0.593

b). Between 2010 and 2014, inclusive, which team(s) “improved” the most? That is, had the biggest percentage gain in wins?

The Arizona Diamondbacks, Boston Red Sox, Houston Astros, Cleveland Indians, and Baltimore Orioles

Top Five teams with the biggest percentage gain in wins between years 2010 and 2014, inclusive
team year diff_log win_pct diff_log_win
Arizona Diamondbacks 2011 -0.124 0.580 0.369
Boston Red Sox 2013 -0.139 0.599 0.341
Houston Astros 2014 0.703 0.432 0.317
Cleveland Indians 2013 -0.008 0.568 0.302
Baltimore Orioles 2012 -0.046 0.574 0.298

6.2 Do log increases in payroll imply better performance?

Log increases in payroll have a very weak linear relationship to performance overall.

Is there evidence to support the hypothesis that higher increases in payroll on the log scale lead to increased performance? Pick up a few statistics, accompanied with some data visualization, to support your answer. –> All R^2 are less than 0.5 indicating a weak linear relationship between team-to-team log increase in payroll and their performance.

6.3 Comparison

Which set of factors are better explaining performance? Yearly payroll or yearly increase in payroll? What criterion is being used?

This linear model shows that raw payroll more significantly predicts performance that log payroll, although the R^2 overall is still quite weak.